feat(filesystem): add PageIndex FileSystem and PIFS CLI#302
Conversation
274af6c to
d7d3cb8
Compare
Remove the synchronous=OFF pragma from PIFS catalog inserts so SQLite remains the durable source of truth.
Route default semantic search to the summary projection when summary is the only populated semantic channel.
Only use the fresh event loop fallback for missing running-loop detection, so RuntimeError from a threaded agent run is not retried.
Merge the unified browse command implementation into feat/pageindex-filesystem.
Merge stable key-value browse output into feat/pageindex-filesystem.
Merge removal of legacy semantic commands into feat/pageindex-filesystem.
Merge ask/chat retrieval strategy updates into feat/pageindex-filesystem.
Merge embedding dimension defaults and mismatch guards into feat/pageindex-filesystem.
Merge pifs add command and atomic import handling into feat/pageindex-filesystem.
Return nested PageIndex structure JSON from cat --structure and keep content reads page-based only. Remove the cat --node command surface, related limits, prompts, and structure-text fallback.
* feat(filesystem): add pifs semantic folder build * fix(filesystem): preserve semantic folder command paths * fix(filesystem): retry semantic folder planning * fix(filesystem): balance semantic folder planner guidance
KylinMountain
left a comment
There was a problem hiding this comment.
Automated deep-review summary (recall-oriented)
Machine-assisted review (multi-agent finder + adversarial verification passes) of the PIFS change, optimized for recall of real defects. Each inline finding was independently verified against the PR-head code. Severity tags: High = breaks normal usage / crashes / silent wrong results; Med = real bug with a narrower trigger.
Positioning question first (see the requirements.txt comment)
PIFS introduces sqlite-vec + an OpenAI embedding pipeline, and pifs browse performs vector similarity search, which sits in tension with the README's "Vectorless / No Vector DB" positioning (L14 / L16 / L65 / L67). Flagging for an explicit product + docs decision, not as a code bug.
Highest-impact correctness issues (inlined below)
- Malformed command (
find /docs --where) → uncaughtIndexErrorcrashes the agent turn / CLI. - Read-only commands (
ls/cat/find...) require an embedding key once content is indexed. - Stopword-only query (
--name "the") silently returns[]. - One ambiguous title aborts the whole
browse. - Metadata filters: numeric
$eqint/float mismatch,$gt/$ltexcludes text-stored numerics, large-int float precision loss — three distinct holes in the same area.
Investigated but NOT flagged (verified safe)
SQLiteSession thread / shared-history concerns (SDK uses check_same_thread=False + per-instance in-memory DB); decode_vector dimension (index-level search/upsert already validate); __init__.py only swallows ModuleNotFoundError for the 4 optional deps with a re-raise guard; add_file readiness guards not reused on register (intentional deferred-metadata design); the folder-vs-file depth -1 asymmetry (actually correct — a file is one tree level below its folder).
Lower-severity (not inlined, for completeness)
cat --range past EOF → nonsensical start>end empty result; the pagination "next" command can point past document EOF; SQLiteVecSemanticIndex.reset() uses auto-committing executescript with no rollback (recoverable — rebuildable index); with self.connect() commits but never closes the connection (bounded by CPython GC); entity/relation channels are declared & plumbed but never populated (dormant — hidden from capabilities & rejected up front, only over-advertised in static help). The default-model change to gpt-5.4 looks intentional (retrieve_model was already gpt-5.4).
There are also several cleanup/DRY opportunities (OpenAI-client construction duplicated 3x; normalize_path / JSON-coercion / textwrap.shorten reimplemented; MetadataGenerator lazy-init copy-pasted 3x; per-file LLM/embedding calls that could batch) — happy to file separately if useful.
Generated with assistance; treat as input, not gospel — verify before acting.
| PyPDF2==3.0.1 | ||
| python-dotenv==1.2.2 | ||
| pyyaml==6.0.2 | ||
| sqlite-vec>=0.1.9 |
There was a problem hiding this comment.
[Positioning / architecture — for discussion] New hard dependency sqlite-vec + embedding-based similarity search. PIFS adds sqlite-vec and an OpenAI embedding pipeline (semantic_projection.py / semantic_index.py), and pifs browse performs vector similarity search. That's in tension with PageIndex's headline positioning — the README states "Vectorless, Reasoning-based RAG" and "No Vector DB ... instead of vector similarity search" (README L14 / L16 / L65 / L67).
Worth an explicit product call: is the semantic projection an intentional, clearly-scoped file-location aid (with reasoning-based structural retrieval still primary), and should that be documented so it doesn't contradict the "vectorless" claim? It also makes sqlite-vec a required dependency.
There was a problem hiding this comment.
Not changing this in the correctness-fix commit. The code fixes above keep semantic projection scoped to PIFS browse as a ranked file-location aid; the semantic backend returns candidate document ids for catalog resolution, while evidence still comes from bounded cat/grep/stat/PageIndex reads. The README/product wording needs an explicit docs decision rather than a silent dependency edit in this bugfix pass, so I am leaving this as positioning context for the PR discussion.
Summary
This PR adds PageIndex FileSystem (PIFS): a filesystem-like interaction system for agents working inside a PageIndex workspace, plus a
pifsCLI and anask/chatloop built on the same command surface.The core purpose is to help an agent quickly locate the right file in a workspace, then combine that filesystem context with PageIndex structure, metadata, and projection indexes to retrieve precise file evidence.
Goal
PIFS gives agents a stable filesystem-like interface to PageIndex workspaces:
browseis part of the file-location step. It ranks file candidates within a folder scope when folder names and exact filters are not enough. Evidence still comes from boundedcat,grep, and PageIndex structural reads.What Changed
pifs, a shell-style CLI for workspace navigation, file discovery, metadata filtering, source reads, imports, and agent execution.pifs addfor atomic local imports into workspace-owned artifacts.pifs askandpifs chat, where the agent uses the same read-only filesystem commands available to users.pifs semantic-folder build [source_scope]materializes a generated<source_scope>/semantictree from canonicalizeddomain/topicmetadata.grep -R, path ambiguity, projection dimension mismatches, atomic import cleanup, and semantic-folder rebuild safety.Command Surface
--workspace,--env-file,--jsonpifs set workspace <path>pifs ls,pifs tree,pifs find,pifs statpifs browse [-R] <folder> "<query>" [--space summary|entity|relation] [--where JSON] [--page N]pifs cat <path> --structure|--page|--range|--all,pifs grep [-R] <pattern> <path>pifs add <physical_path> <virtual_path>,pifs semantic-folder build [source_scope]pifs ask "<question>",pifs chatThe agent command surface intentionally exposes only read/navigation commands:
ls,tree,find,browse,grep,cat, andstat. It can use an existing semantic folder like any other tree, but it cannot build one.Key Files
pageindex/filesystem/core.py: high-level PIFS API, registration flow, metadata generation, projection wiring, semantic-folder build orchestration, and browse behavior.pageindex/filesystem/store.py: SQLite workspace catalog for folders, files, metadata, generated memberships, and PageIndex/projection state.pageindex/filesystem/commands.py: command parser, executor, shell rendering, capabilities, and guardrail messages.pageindex/filesystem/agent.py:ask/chatpolicy and streaming loop over the PIFS command surface.pageindex/filesystem/semantic_folder.py: Semantic Folder planner contract, OpenAI planner, plan schema, and validation rules.pageindex/filesystem/semantic_projection.pyandsemantic_index.py: summary projection indexing and vector search adapter used bybrowse.pageindex/filesystem/metadata.pyandmetadata_generation.py: metadata schema, policy, status, and generated metadata helpers.pageindex/filesystem/cli.pyandpifs: CLI entrypoints.examples/pifs_demo.py: local end-to-end demo over example documents.Verification
uv run pytest tests/test_filesystem_store.py tests/test_import_surface.py tests/test_metadata_generation.py tests/test_pageindex_filesystem_scope.py tests/test_pageindex_structural_read.py tests/test_pifs_add_command.py tests/test_pifs_agent_stream.py tests/test_pifs_cli.py tests/test_pifs_find_maxdepth.py tests/test_pifs_like_escape.py tests/test_pifs_path_resolution.py tests/test_pifs_register_side_effects.py tests/test_pifs_semantic_folder.py tests/test_semantic_index.py/SEC_Filings_LTM/semanticbuilt astopic/domainwith 82 files, 82 memberships, and 0 skipped files./33capital/semanticbuilt astopic/domainwith 18 files, 18 memberships, and 0 skipped files.